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Abstract 

We consider the problem of finding an n-agent joint-policy for the 
optimal finite-horizon control of a decentralized Pomdp (Dec-Pomdp). 
This is a problem of very high complexity (NEXP-hard in n > 2). In 
this paper, we propose a new mathematical programming approach 
for the problem. Our approach is based on two ideas: First, we rep- 
resent each agent's policy in the sequence-form and not in the tree- 
form, thereby obtaining a very compact representation of the set of 
joint-policies. Second, using this compact representation, we solve 
this problem as an instance of combinatorial optimization for which 
we formulate a mixed integer linear program (MILP). The optimal so- 
lution of the MILP directly yields an optimal joint-policy for the Dec- 
Pomdp. Computational experience shows that formulating and solv- 
ing the MILP requires significantly less time to solve benchmark Dec- 
Pomdp problems than existing algorithms. For example, the multi- 
agent tiger problem for horizon 4 is solved in 72 sees with the MILP 
whereas existing algorithms require several hours to solve it. 



1 Introduction 

In a finite -horizon Dec-Pomdp (TJ, a set of n agents cooperate to control a Markov 
decision process for k steps under two constraints: partial observability and decentral- 
ization. Partial observability signifies that the agents are imperfectly informed about 
the state of the process during control. Decentralization signifies that the agents are 
differently imperfectly informed during the control. The agents begin the control of 
the process with the same, possibly imperfect, information about the state. During the 
control each agent receives private information about the state of the process, which he 
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cannot divulge to the other agents. The agents' private information can have an impact 
on what they collectively do. Thus, before the control begins, each agent must reason 
not only about the possible states of the process during the control (as in a Pomdp) but 
he must also reason about the information that could be held by other agents during 
the control. In effect, the agent must also reason about which policies the other agents 
would use. Partial observability and decentralization make Dec-Pomdps very difficult 
to solve. Finding an optimal solution to a Dec-Pomdp is NEXP-hard in the number of 
agents QJ; finding a locally optimal solution to a Dec-Pomdp is NP-hard in the size of 
the Dec-Pomdp problem (determined by k and the sizes of the sets of joint-actions and 
joint-observations) Q. 

1.1 Motivation for a new approach 

The three existing exact algorithms DP [4|, MAA* JTT] and PBDP [10] are able to 
solve only very small Dec-Pomdps in reasonable time (2 agents, horizon < 4, ac- 
tion and observation set sizes < 3). Their lack of scalability is predictable from the 
negative complexity results. Therefore, the question is not so much whether these al- 
gorithms can be improved upon in the absolute, but rather if a relative improvement 
can be achieved. In other words, can we push the computational envelop a bit further 
on this problem? In this paper, we present a new approach based on integer program- 
ming, which does manifest a much superior performance in practice than the existing 
algorithms. For instance, through our approach, the multi-agent Tiger problem [8] for 
horizon 4 can be solved in 72 seconds as against the few hours required by the PBDP 
algorithm ifTUll (the only current algorithm able to solve this instance). Similarly, the 
MABC problem [4] for horizon 5 is solved in 25 seconds as against the 10 5 seconds 
required by PBDP. So we might tentatively answer in the positive to the above ques- 
tion. There is of course a more relevant reason for pushing this envelop. The three 
algorithms serve as a basis for approximate algorithms such as Approximate-DP [3| 
and MBDP |9|, and these seem to scale to much longer horizons and to much larger 
problems. So, a more efficient exact algorithm is important from this perspective as 
well. We discuss this in more detail in the last section. 

1.2 A new, mixed integer programming approach 

Existing Dec-Pomdp algorithms represent an agent's policy as a tree and a joint-policy 
as a tuple of policy-trees. The size of the set of policy-trees of each agent is doubly 
exponential in the horizon. Hence, the set of joint-policies is doubly exponential in the 
horizon and exponential in the number of agents. This adversely impacts the space and 
time requirements of the algorithms. In our approach we discard the tree representa- 
tion in favor of the sequence-form representation which was introduced in a seminal 
paper on computational game theory j6). In the sequence-form, every finite -horizon 
deterministic policy of an agent can be represented as a subset of the set sequences of 
actions and observations of the agent. The problem of finding an optimal deterministic 
joint-policy is thus equivalent to the problem of finding for each agent a subset from 
a larger set. This problem thus becomes an instance of combinatorial optimization 
and we conceive a mixed integer linear program (MILP) to solve it. The key insight 
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of Koller's approach (and therefore of our approach) is that the size of the set of se- 
quences from each subset is drawn is only exponential in the horizon and not doubly 
exponential in it, as is the case with the size of the set of policy trees. This allows 
us to formulate an MILP whose size is exponential in n and n. For small problems 
such as MA-Tiger and MABC, it is feasible to represent the MILP in memory. Fur- 
thermore, and equally importantly, the constraints matrix of the MILP is sparse. The 
consequence of this is that in practice the MILP is solved very quickly (in the order 
of seconds). Thus, we have an effective method to compute an optimal deterministic 
finite-horizon joint-policy. Restricting attention to deterministic joint-policies does not 
limit the applicability of our approach in any way since in every finite-horizon Dec- 
Pomdp there exists at least one optimal joint-policy that is deterministic. It is also not 
evident that relaxing this restriction has any benefit. Implicitly, existing algorithms also 
restrict attention to deterministic joint-policies. In this paper 'policy' and 'joint-policy' 
shall mean deterministic policy and deterministic joint-policy respectively unless oth- 
erwise specified. 

2 The finite-horizon Dec-Pomdp problem 

A finite-horizon Dec-Pomdp problem is defined by the following elements. We are 
given N, a set of n agents and S, a set of states. The n agents in N are numbered from 
1 to n. The states are numbered from 1 to \S\. For each zth agent, we are given A i7 the 
agent's set of actions and f^, his set of observations. The cross-product A\ x A 2 . . . x 
A n is called the set of joint-actions and it is denoted by A. Similarly, the cross-product 
£li x fl 2 ■ ■ ■ x Q n is called the set of joint-observations and it is denoted by ft. The 
joint-actions are numbered from 1 to \ A\ and the joint-observations are numbered from 
1 to \ Then, we are given for each ath joint-action, the matrices T a , Z a and the 
vector R a : 

(a) Tf a , is the probability of transitioning to the s'th state if the agents take the ath 
joint-action in sth state. 

(b) Z®, is the probability of the agents receiving the oth joint-observation and transi- 
tioning to s'th if they take the ath. 

(c) Rg is the real-valued reward the agents obtain if they take the ath joint-action in 
the sth state. 

We are given bo, which represents the initial belief state and it is common knowledge 
amongst the agents. A belief state is a probability distribution over S. In a belief state 
b, the probability of the sth state is denoted by b[s] . Finally, we are given n > 1, a finite 
number that is the horizon of the control. The control of the Dec-Pomdp is described 
as follows. At each step t of k steps: the agents take a joint-action, they receive a 
joint-observation, they receive a common reward r t , and the process transitions to a 
new belief state as a function of the previous belief state, the joint-action and the joint- 
observation. However, at each step, agents do not reveal to one another the actions they 
take and observations they receive at that step or at previous steps. Since an agent does 
not know the actions taken by the other agents and the observations received by the 
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Figure 1: A 3-step policy ip. 



other agents during the k steps, at each step he takes actions strictly as a function of 
the actions he has taken previously and observations he has received previously. This 
function is called his policy. To control the Dec-Pomdp for k steps, each agent requires 
a ft-step policy, henceforth written as K-policy. The tuple of the agents' policies forms 
a joint-policy. An optimal joint-policy is one which maximizes E(J2t=i r t)' the sum 
of expected rewards the agents obtain for the k steps. 

2.1 Policy in the tree-form 

The canonical representation of a policy, used in existing Dec-Pomdp algorithms, is the 
tree-form. In this form, a K-policy of the ith agent can be represented as a rooted tree 
with k levels in which each non-terminal node has |f2j| children. This tree is called a 
K-policy-tree. Each node is labeled by an action to take and each edge is labeled by an 
observation that may occur. Using a policy-tree, during the control, the agent follows 
a path from the root to a leaf depending on the observations he receives. An example 
of a policy-tree is shown in Figure[T] The number of nodes in a K-policy-tree of the ith 
agent is ^jjf-prf ■ It is thus exponential in k. For example, with = 2, a 3-policy- 
tree, as the one shown in Figure[T] has \zl — 7 nodes. The set of /{-policy-trees of the 
ith agent is the set of all the LS^-j-j ~f sized permutations of the actions in Aj. Therefore, 

the size of the set of K-policy-trees of the ith agent is \Ai\ ^i- 1 , doubly exponential 
in k. 

3 Policy in the sequence-form 

The double exponentiality associated with the set of policy-trees can be avoided by 
using the sequence-form representation of a policy. We begin a description of this 
representation by defining a sequence. 

Definition 1 A sequence of length t of the ith agent is an ordered list of2t - 1 elements, 
t > 1, in which the elements in odd positions are actions from A4 and those in even 
positions are observations from Clj,. 

Thus, in a sequence of length t there are t actions and t - 1 observations. The shortest 
possible sequence is of length 1, which consists of just an action and no observations. 
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We denote the set of all possible sequences of length t which can be conceived from 
Ai and Oi by S\. We denote the set S} U Sf U . . . 5f by Si. We shall now see how a 
K-policy can be represented as a set of sequences, or more precisely as a subset of £,-. 
Assume that n = 3 and the policy-tree ip shown in Figure[T]is of the ith agent. Starting 
from the root-node and descending down the edges of the tree, we can enumerate the 
sequences of this tree. The first sequence we obtain is in the root-node itself, the 
sequence consisting of the action c and no observations. This is a sequence of length 
1 . Then, going down the edge labeled by u from the root-node, we come to the node 
labeled by the action /. At this point, we obtain a second sequence cuf, which is of 
length 2. It has two actions and one observation. Similarly, taking the other edge from 
the root-node, we come to the node labeled by d and obtain a third sequence cvd, also 
of length 2. When all the leaves of the tree have been visited, the set of sequences we 
obtain is, 

S(ip) = | c, cuf, cvd, cufuc, cufvf, cvdud, cvdvc | 

This set contains 1 sequence of length 1, 2 sequences of length 2 and 4 sequences of 
length 3 to give a total of 7 sequences corresponding to the 7 nodes in ip. It is evident 
that the set S(ip) is equivalent to the policy-tree tp. That is, given the set S(ip), the 
agent can use it as a 3-step policy. As this simple exercise shows, any finite-step policy 
can be written as a finite set of sequences. Now, S(ip) is a subset of Si, the set of all 
possible sequences of lengths less than or equal to 3, and so is every 3-policy of the ith 
agent. Thus, for any given value of k, every k -policy of the ith agent is a subset of Si. 
This is main idea of the sequence-form representation of a policy. 

3.1 Policy as a vector 

We can streamline the subset-set relationship between a « -policy and Si by represent- 
ing the former as a vector of binary values. Let the sequences in Si be numbered from 
1 to |<Sj|. Since every K-policy of the ith agent is a subset of Si, every sequence in ^ 
is either in the policy or it is not. Thus a /t-policy of the ith agent can be represented 
as a |iSi|-vector of binary values or 1, such that if the jth sequence in Si is in the 
policy then the jth element of the vector equals 1 and if it is not, then the jth element 
of the vector equals 0. Let the set of \Si [-vectors of binary values or 1 be denoted by 
Xi. Thus every K-policy of the ith agent is member of the set Xi. Let p be the jth se- 
quence in Si. For a vector Xi € Xi, value of the jth element in Xi shall be conveniently 
represented as Xi [p] . 

3.2 Policy constraints of the ith agent 

Thus, every K-policy of the ith agent is a member of Xi. The inverse of this is course 
untrue; not every member of Xi is a K-policy. We therefore need to define which vec- 
tors in Xi can represent a K-policy. We shall give a more general definition, one that 
includes stochastic policies as well as deterministic policies. We shall in fact define 
which vectors in represent a K-step policy, be it a stochastic policy or a determin- 
istic one. The definition takes the form of a system of linear equations which must be 
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satisfied by a vector in Rl 5i l if it is to represent a k -policy. Given a sequence p, an 
action a and an observation o, let poa denote the sequence obtained on appending o 
and a to the end of p. Let denote the set Sj U Sf U . . . S*~ l . 

Definition 2 Let \Si\ = z. A vector t»el ! is a K-step, possibly stochastic, policy of 
the ith agent if, 

]T w[a] = 1 (1) 

a£Ai 

w\p] - w iP oa \ = °> Vpe 5-, o e fi* (2) 
w > (3) 

We call the system of linear equations ([T]i-® the policy constraints of the ith agent. 
Policy constraints recreate the tree structure of a policy. They appear in a slightly 
different form, as Lemma 5.1 in [6|. We can write the policy constraints in the matrix 
form as C-{w — bi, w > 0, where Ci is the matrix of the coefficients of the variables 
in the equations ([TJ-(|2]i and bi is a vector of appropriate length whose first element is 1 
and the remaining elements are 0, representing the r.h.s of the equations. Note that it is 
implicit in the above definition that the value of each element of w is constrained to be 
in the interval [0,1]. Hence, we can define a deterministic K-policy of the ith agent as 
follows. 

Definition 3 A vector Xi £ Xi is a n-policy of the ith agent ifCiXi = 

We shall call a policy represented as a vector as a policy-vector just to distinguish 
it from a policy-tree. The representation of a policy as a policy-vector is in fact the 
sequence-form representation we have been alluding to. Given a vector from Xi G Xi 
which satisfies the policy constraints, the agent can use it just as he would use as a 
policy-tree without requiring any additional book-keeping. Let choosing a sequence 
mean taking the last action in the sequence. In using Xi, at the first step, he chooses the 
action a such that Xi [a] = 1. There will be only one such action. Then on receiving an 
observation, say o, he chooses the sequence aoa' such that Xi[aoa'] = 1. Again there 
will be only one such sequence. In general, if at step t he has chosen the sequence p 
and then received the observation o, then he chooses the unique sequence poa" such 
that Xi[poa"] = 1 at the (t + l)th step. Thus, at each step, the agent must know the 
sequence of actions he has taken and the sequence of observations he has received till 
that step in order to know which action to take according to Xi . This requirement is 
called perfect recall in game theory, and it is implicit in the use of a policy-tree. 

3.3 Advantage of the sequence-form representation 

The size of Sf is | Ai\ f |f2j| t_1 . The size of Si is thus Y^t=i l^-'l* l^il' -1 ' exponential in 
K. Since every K-policy is in theory available if the set Si is available, the latter serves 
as a search space for K-policies of the ith agent. The good news is of course that this 
search space is only exponential in k. This compares favorably with the search space 
represented by the set of K-policy-trees which is doubly exponential in k. We thus 
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have at our disposal an exponentially smaller space in which to search for an agent's 
policy. More precisely, to find a K-policy of the ith agent, we need to set up and solve 
the system the policy constraints. The number of equations in this system is Cj = 1 + 
^2t=i ^1*1^*1'- ^ i s thus a a x \Si \ matrix. Now notice that C\ is a sparse matrix, 
that is, it has only a very small number of nonzero entries per row or column, while 
most of its entries are Os. In Cj, the number of nonzero entries per row is only 1 + | Ai |, 
and it is constant per row. Sparse matrices are typically easier to solve that dense 
matrices of the same size. The relatively small size of d and its sparsity combine to 
form a relatively efficient method to find a K-policy of the zth agent. 

4 Value of a Joint-policy 

The agents control the the finite-horizon Dec-Pomdp by a K-step joint-policy, hence- 
forth written as a k -joint-policy. A joint-policy is just the tuple formed by the agents' 
individual policies. Thus, a k -joint-policy is an n-tuple of K-policies. A k -joint-policy 
may be an n-tuple of K-policy-trees or it may be an n-tuple of K-policy-vectors. Given 
a joint-policy 7r in either representation, the policy of the ith agent in it shall be denoted 
by 7Tj. A joint-policy is evaluated by computing its value. The value of a joint-policy 
represents the sum of expected rewards the agents obtain if it is executed starting from 
the given initial belief state b . The value of a joint-policy tt shall be denoted by V(ir). 

4.1 Value of a joint-policy as an n-tuple of policy-trees 

Given a i-policy a of an agent, t < k, let a(a) denote the action in the root node of 
a and let er(o') denote the sub-tree attached to the root-node of a into which the edge 
labeled by the observation d enters. Furthermore, given a t -joint-policy n, let a(n) 
denote the joint-action (a(ni), a(7r 2 ), . . ., a(7r„)) and given a joint-observation o, let 
7r(o) denote the (t — l)-joint-policy (7Ti(oi), 712(02), . . ., 7r„(o„)). Now let n be a k- 
joint-policy which is an n-tuple of K-policy trees. The value of it is expressed in terms 
of the K-step value-function of the Dec-Pomdp denoted by V K as follows, 

V(tt) =^&oW^ K ( S ,7r) (4) 

in which V K is expressed recursively as, 

V*{s, n) = T^Z^V^id, n(o)) (5) 

oeo s'gS 

For t = 1, V*(s, a) — R". An optimal K-joint-policy is one whose value is the maxi- 
mum. 

4.2 Value of a joint-policy as an n-tuple of policy-vectors 

The value of a K-joint-policy that is an n-tuple of policy-vectors is expressed in terms 
of the values of its joint-sequences. A joint-sequence is defined analogously to a se- 
quence. 
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Definition 4 A joint-sequence of length t is an ordered list of2t - 1 elements, t > 1, in 
which the elements in odd positions are joint-actions from A and those in even positions 
are joint-observations from Q. 

Equivalently, we can also define a joint-sequence of length t as an n-tuple of sequences 
of length t. Given a joint-sequence q, the sequence of the zth agent in q shall be denoted 
by qi. The set of joint-sequences of length t, denoted by 5*, is thus the cross-product 
set 5' x ^2 ... x 5^. Given a joint-sequence q of length t, the (j < t)th joint-action 
in it shall be denoted by a? q and the [h < <)th joint-observation in it shall be denoted 
by o q . We now define the value of a joint-sequence. 



4.3 Joint-sequence value 

The value of a joint-sequence q of length t, denoted by v(q), is independent of any 
joint-policy. It is simply a property of the Dec-Pomdp model. It is a product of two 
quantities: p(q), the probability of q occurring and IZ(q), the sum of expected rewards 
the joint-actions in q obtain: 

V {q) = p{q)K(q) (6) 
These quantities are defined and computed as follows. p(q) is the probability, 

p{q) = Pr(o 1 ql o 2 ,...,o q ~ 1 \b ,a 1 q ,a 2 q ,...,a q ' 1 ) (7) 

t-i 

= n pr Ki 6 ?-i) ^ 

3=1 

where is a belief state which, if computed as follows, serves a sufficient statistic 
for the joint-sequence (a q , o q , . . ., o 3 ^ 2 , a^ -1 )- Let o denote cP q and a denote a J q . Let 
bj_ 1 be given. Then, 

Pr (°l fo 3-i) = E E bUWss-Zt'o (10) 
ses s'es 

and bj is given as, (for each s G S), 

Thus is computed as follows. We assign 6o to 6g. For each non-zero j < i, we 
calculate Pr^l^ ^ using eq. (fTOb . If for any t, we find that Pr(o^ is 0, we set 
p(q) to and terminate. On the other hand, whenever Prfo^lbJ^) > 0, we compute 
b Q j[s] for each state s E S using eq. dTTb and continue. The quantity lZ(q) is simply 
the sum of the expected rewards the joint-actions in q obtain in the belief states b%. 



8 



Assigning, as before, bo to 6q, and denoting by a, 

t 

3=1 «es 

Recall that fundamentally, a K-policy is just a set of sequences of different lengths. 
Given a policy a of the ith agent let the subset of a containing sequences of length t 
be denoted by cr*. Then given a joint-policy tt, the set of joint-sequences of length i of 
tt is simply the set tt[ x 7r| x . . . x 71^. Note that if a joint-sequence q is in tt, then 
n^=i [<?»] = 1 an d if it is n °t, then Oi==i ""-i [?»] — 0- We can now define the value of 
a k -joint-policy tt in terms of the values of the joint-sequences. In particular, we need 
consider only joint-sequences of length n. Thus, 

n 

V(tt)= J2 K9) II fife] (13) 
i=i 

The derivation of eq. (fOJ from eq. © is quite straightforward and is omitted. 



5 Algorithm 

We shall now describe a mixed integer linear program (MILP) that finds an optimal k- 
joint-policy. We start our description with the following naive mathematical program 
(MP) which just implements the definition of an optimal « -joint-policy. This implies 
finding for each agent i a vector Xi £ Rl Si l which satisfies the policy constraints of the 
ith agent and the quantity V(x\, X2, ■ ■ ■, x n ) is maximized. Letting x = (x\, X2, ■ ■ ■, 
x n ), the naive MP, denoted by MP-Dec, is as follows, 

n 

maximize f{x) = ^ v ( ( l)W^i[ c li] ( 14 ) 

q eS K i=l 

s.t. VieN: CiXi = bi (15) 
Xi > (16) 



An optimal solution to MP-Dec would yield an optimal (possibly, stochastic) K-joint- 
policy x. However, since f(x) is a nonconcave, nonlinear function, not only is solving 
MP-Dec NP-hard, but more importantly, it is also not possible to guarantee finding a 
globally optimal K-joint-policy. A simple fix to get rid of the nonconcave nonlinear 
f(x) in MP-Dec is to somehow linearize f(x), that is, to transform it into a linear func- 
tion. Linearization of f(x) is achieved by using more variables and more constraints in 
addition to those in MP-Dec. The additional variables pertain to joint-sequences and the 
additional constraints are required to relate the variables of joint-sequences to those of 
sequences. The linearization of f(x) takes place in three steps. At the end of the three 
steps, MP-Dec is converted to an integer linear program (ILP) on which the proposed 
MILP is based. 
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5.1 Linearization of f(x): step 1 



The simple idea in linearizing a nonlinear function is to use a variable for each non- 
linear term that appears in the function. In the case of f(x), the nonlinear terms are, 
for each joint-sequence q of length k, 11™= i ^4*]- Therefore, to replace the nonlinear 
terms in f(x), we need to use a variable for every joint-sequence q of length k. Let 
y[q] > be the variable for q and let, 

qeS K 

So the first step in linearizing f(x) is to change the objective in MP-Dec to f(y) and 
introduce the |<S K |-vector y > of variables in it. We denote this modified MP by 
MPl-Dec. 

5.2 Linearization of fix): step 2 

Once the objective is changed to f(y), we need to relate the variables representing 
joint-sequences (y) to those representing agents' sequences (the Xi vectors). In other 
words, we need to add the following constraints to MPl-Dec, 

n 

n«i[«i] =FM, V ? e5 K (18) 

i=l 

But the constraints ( fT8l are nonconvex. So, if they are added to MPl-Dec, it would 
amount to maximizing a linear function under nonconvex, nonlinear constraints, and 
again we would not have any guarantee of finding the globally optimal solution. We 
therefore must also linearize these constraints. We shall do this in this step and the next. 
Suppose that (x\, X2, ■ ■ ■, x n ) is a solution to MPl-Dec. Then, for each joint-sequence 
q of length k, n"=i ^* takes a value in [0,1]. In other words, it can take an infinite 
number of values. We can limit the values it can take by requiring that the vectors Xi 
be vectors of binary variables, or 1. Moreover, since we want Yi7=i ^>fe] t0 e q ua l 
y[q], but want to avoid the constraints i 181 . we should also require that each y variable 
be a binary variable. Thus, the second step in linearizing f(x) is to add the following 
constraints to MPl-Dec: 

*i[p]e {0,1}, VieN,V P eSi (19) 

y[q]£ {0,1}, VqeS K (20) 

Note that with these constraints in MPl-Dec, Xi would represent a deterministic k- 
policy of the ith agent. Constraints (fT~9b - d20b are called integer constraints. We denote 
the MP formed by adding integer constraints to MPl-Dec by MP2-Dec. 

5.3 Linearization of f(x): step 3 

This is key step in the linearization. The number of sequences of length k in a K-policy 
of the ith agent is T{ = |f2i| K ~ 1 . Hence the number of joint-sequences of length k 
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in a k -joint-policy r = Y["=i T i- Let, r_i = — . Now suppose (Ti, a?2, ■ ■ ■, x n ) is a 
solution to MP2-Dec. Each is a K-step deterministic policy of the zth agent. The 
« -joint-policy formed by them is also deterministic. If for a sequence p of length k, 
Xi [p] = 1, then it implies that for exactly t_ s ; joint-sequences q of length k in which the 
sequence of the zth agent is p, Yi7=i ^» M = O n tne other hand, if X{ [p] = 0, then 
for each joint-sequence q in which the sequence of the i agent is p, Yl^—i %i[q] = 0. 
This can be represented mathematically as, 

n 

£ n^fe] = r -^M' VieiV,VpG«S? (21) 

The set of equations (l2Tl iis true for every K-step deterministic joint-policy, and it allows 
us to linearize the constraints (fT8l l. All we have to do is to add the following set of linear 
constraints to MP2-Dec, 

£ V[q]=T- i x i \ P ], Vi€N,VpeS? (22) 
If these constraints are added to MP2-Dec then the following holds, 

n 

Y[x J [q J }=y[q], VqeS* (23) 
3=1 

because the r.h.s. of their corresponding equations are equal. Thus, we have achieved 
the linearization of the constraints (TT~8b and therefore of f(x). We shall call the con- 
straints (1221 as the joint-policy constraints. The MP obtained on adding the joint- 
policy constraints to MP2-Dec gives us the integer linear program ILP-Dec, on which 
the mixed ILP (MILP), the main contribution of this paper, is based We give ILP-Dec 
below for the sake of completeness. 

5.4 Integer linear program ILP-Dec 

1. Variables: 

(a) A liS^I-vector of variables, y. 

(b) For each agent i G N, an |<S, | -vector of variables, Xi. 

2. Objective: 

maximize f(y)= £ KsOl/M ( 24 ) 

3. Constraints: for each agent i 6 N, 

(a) Policy constraints: 

£ *i[oi] = 1 (25) 
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Vte{i,2,...,K-i},Vpe5J,Voie 



(26) 



(b) Joint-policy constraints: for each p € <Sf, 



T_»a:i[p] 



(27) 



4. Integer constraints: 



Xi[p] 6 {0,1}, 

y[«]e{o,i}, 



(28) 
(29) 



We thus have the following result. 

Theorem 1 An optimal solution (x\, x 2 , ■ ■ ., x n ) to ILP-Dec yields an optimal K-joint- 
policy for the given Dec-Pomdp. (Proof is omitted) 

5.5 Mixed integer linear program MILP-Dec 

An ILP is so called because it is an LP whose variables are constrained to take integer 
values. In ILP-Dec, each variable can be either or 1. The principle method for 
solving an integer linear program is branch and bound. So when solving ILP-Dec, a 
tree of LPs is solved in which each LP is identical to the ILP-Dec but in which the 
integer constraints are replaced by non-negativity constraints (i.e., all the variables are 
allowed to take real values greater than or equal to 0). In general, the lesser the number 
of integer variables in an LP, the faster a solution will be obtained. Therefore it is 
desirable to minimize the number of integer variables in an LP. An LP in which some 
variables are allowed to take real values while the remaining ones are constrained to 
be integers is called a mixed ILP (MILP). Thus, an MILP may be solved faster than an 
ILP of the same size. We say that an MILP is equivalent to an ILP if every solution 
to the MILP is also a solution to the ILP. An MILP that is equivalent to ILP-Dec can 
be conceived as follows. Let this MILP be denoted by MILP-Dec. Let MILP-Dec be 
identical to ILP-Dec in all respects except the following: in each vector xu only those 
variables representing sequences of length k be constrained to take integer values or 
1; all the other variables in each Xi and all the variables in the vector y be allowed 
to take real values greater than or equal to 0. Due to the equivalence, we have the 
following result. 

Theorem 2 An optimal solution (x\, x^, . . ., x n ) to MILP-Dec yields an optimal k- 
joint-policy for the given Dec-Pomdp 

The proof of this theorem (and of the claim that MILP-Dec is equivalent to ILP-Dec) is 
omitted due to lack of space. The discussion henceforth applies to ILP-Dec as well. 
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6 Improving MILP-Dec 



We now discuss two heuristics for improving the space and time requirement of for- 
mulating and solving MILP-Dec. 

6.1 Identifying dominated sequences 

The number of variables required in the MILP-Dec can be minimized by using variables 
for only those sequences of each agent that are not dominated. Dominated sequences 
need not be represented in the MILP-Dec because there always exists an optimal k- 
joint-policy in which none of the policies contains a dominated sequence. We first 
define dominated sequences of length n. Given sequences p and p' of length k of the 
zth agent, p' shall be called a co-sequence of p if it is identical to p except for its last 
action. Let C(p) denote the set of co-sequences of p. Then, p is said to be dominated if 
there exists a probability distribution 8 over C(p), such that for every joint-sequence q 
of length k in which the sequence of the ith agent is p, the following is true: 

"(q)< E *(pW) ( 3 °) 
p'ec(p) 

in which q' = (q\, . . ., p', Qi+i, ■ ■ ■, q n ). Dominated sequences of length k can 
be identified through iterated elimination. Identifying sequences of lengths less than k 
is easier. A sequence p of length f is a descendant of a sequence p" of length j < t 
if the first j actions and j - 1 observations in p are identical to the j actions and j - 1 
observations in p" . A sequence p" of length j is dominated if every descendant of p" 
is dominated. So, for each agent, we first identify dominated sequences of length k, 
and then working backwards, we identify dominated sequences of lengths less than k. 
Note that // dominated sequences are not represented by variables in MILP-Dec, then in 
each joint-policy constraint the = sign must be replaced by the < sign. The MTLP that 
results when dominated sequences of all the agents are not represented by variables in 
MILP-Dec and the above modifications are made shall be denoted by MILP-Pr-Dec. 

6.2 Adding bounds into MILP-Dec 

The MILP solver can be guided in its path selection in the tree of LP problems or 
made to terminate as early as possible by providing lower and/or upper bounds on the 
objective function. In this paper, we wish to illustrate the importance of integrating 
bounds in MILP-Dec, and so we have used rather loose bounds. Given V(t), the value 
of an optimal £ -joint-policy, a lower bound on the value of the optimal (t + ^-joint- 
policy is, 

£ = V(i) + max mini?" (31) 

For an upper bound, the value u of an optimal K-step policy of the Pomdp correspond- 
ing to the Dec-Pomdp can be used. This value can be determined by the linear program 
<T32b - <[33T> which also finds the optimal K-step policy for the Pomdp. Let S l denote 
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Algorithm 


MABC 


MA-tiger 


K 


3 


4 


5 


3 


4 


MILP-Dec 


0.86 


900 


- 


3.7 




MILP-Dec(u) 


1.03 


907 




3.5 




MILP-Dec(f) 


0.93 


900 




4.9 


72 


MILP-Pr-Dec 


0.84 


80 




6.4 




MILP-Pr-Dec(u) 


0.93 


10.2 


25 


6.2 




MILP-Pr-Dec(£) 


0.84 


120 




7.6 


175 


DP 


5 


10 a 








MAA* 


ta 


U 




is 


U 


PBDP 


1.0 


2.0 


10 5 


ta 


u 


DP-JESP 











0.02 


Approx-DP 








0.05 


1.0 


MBDP 


0.01 


0.01 


0.02 


0.46 


0.72 



Table 1 : Comparison of the runtimes in seconds of Dec-Pomdp algorithms, denotes 
several seconds and £4 denotes several hours. "•" denotes a time-out of 30 minutes, "-" 
denotes insufficient memory and blank denotes that the application of the concerned 
algorithm to the concerned problem does not appear in the literature. 



the set of joint-sequences of length t. Let qoa denote the joint-sequence obtained on 
appending the joint-observation o and the joint-action a to the joint-sequence q. 

maximize u = \_. y[<l\ s -t- : (32) 

q£S* 

aeA 

y[q] - y[q°a] = 0, V t < k, q e S\ o e fi (34) 

aeA 

V>0 (35) 

A bound is added to MILP-Dec by adding a constraint. The constraint f(y) > I is 
added for adding the lower bound and the constraint f(y) < u is added for adding the 
upper bound. 



7 Experiments 

We formulated the MABC problem and MA-tiger problem as MILPs, and solved it 
using the ILOG Cplex 10 solver on an Intel P4 machine with 3.40 gigahertz proces- 
sor speed and 2.0 GB ram. The runtime in seconds of MILP-Dec and MILP-Pr-Dec 
for different values of k is shown in Table Q] In the first column, a parenthesis, if 
present indicates which bound is used. The runtime includes the time taken to identify 
dominated sequences and compute the bound (for e.g., solve the LP for the Pomdp), 
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where applicable. We have listed the runtime of existing exact and approximate dy- 
namic programming Dec-Pomdp algorithms as reported in the literature. The three 
exact algorithms are DP, MAA* and PBDP. The approximate algorithms are DP-JESP 
1 8 1, Approximate-DP and MBDP As far as dominated sequences are concerned, the 
MABC problem had about 75% dominated sequences per agent for k = 5, while MA- 
Tiger had no dominated sequences for any horizon. 

8 Discussion and future directions 

In this paper we have introduced a new exact algorithm that for solving finite-horizon 
Dec-Pomdps. The results from Table[T]show a clear advantage of the MILP algorithms 
over existing exact algorithm for the longest horizons considered in each problem. We 
now point out three directions in which this work can be extended. 

Approximate algorithm: Our approach could be a good candidate to construct 
an approximate algorithm. For instance, if MILP-Dec or one of its variant is 
able to solve a problem optimally for horizon k very quickly, then it can be 
used as a ratchet for solving approximately for longer horizons in divisions of 
k steps. Our initial experiments with this simple method on the MABC and 
MA-Tiger problems indicate that it may be comparable in runtime and value 
of the joint-policy found with current approximate algorithms for solving long 
horizons (50,100). This is particularly useful when the Dec-Pomdp problem 
cycles back to the original state in a few steps. In the MA-Tiger problem, for 
example, upon the execution of the optimal 3-step joint-policy, denoted by u 3 , 
the process returns back to its initial belief state. The value of er 3 is 5.19. So 
we can perpetually execute <r 3 to get in m steps, a total expected reward of 
(5.19m/3). Now, the value of a 2 , the optimal 2-step joint-policy is —2. For 
controlling the MA-Tiger problem for m steps, we may either (a) execute er 3 
m/3 times or (b) a 2 m/2 times. The loss for doing (b) instead of (a) would 
be 2.73/m per step. This can be made arbitrarily high by changing the reward 
function. In other words, finding er 3 is much more important that finding a 2 . We 
can arrange for a similar difference in quality between <r 4 and ct 3 ; and MILP-Dec 
is able to find a 4 in 72 sees while other algorithms take hours. Thus, the role 
of an exact, fast algorithm, such as ours, may prove crucial even for very small 
problems. 

Dynamic programming: In formulating MILP-Dec we are required to first gen- 
erate the set <Sf for each agent i. The size of this set is exponential in n. The 
generation of this set acts as the major bottleneck for formulating MILP-Dec in 
memory. However, we can use dynamic programming to create each set <Sf 
incrementally in a backward fashion. Such a procedure does not require the 
knowledge of bo and it is based on the same principle as the DP algorithm. In 
brief, the procedure is explained as follows. For each nonzero t < k, we generate 
for each agent a set of sequences of length t by doing a backup of a previously 
generated set of sequences of length t - 1 of the agent. We then compute for 
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each joint-sequence of length t, an |5| -vector containing the values of the joint- 
sequence when the initial belief state is one of the states in S. We then prune, for 
each agent, sequences of length t that are dominated over belief space formed by 
the cross-product of S and the set of joint-sequences of length t. By starting out 
with the set Sf (which is in fact just the set Ai) for each agent i, we can incre- 
mentally build the set Sf. Note that a backup of the set Sj creates |Aj||f^||<S*| 
new sequences; i.e., the growth is linear. In contrast, the backing-up of a set of 
policies represents an exponential growth. The merit of this procedure is that 
we may be able to compute an optimal joint-policy for a slightly longer horizon. 
But more importantly, due to the linear growth of sequences in each iteration, it 
may be possible to solve for the infinite-horizon by iterating until some stability 
or convergence in the values of joint-sequences in realized. 

Pompds: Finally, the approach consisting of the use of the sequence-form and 
mathematical programming could be applied to Pomdps. We have already shown 
in this paper how a finite-horizon Pomdp can be solved. In conjunction with the 
dynamic programming approach analogous to the one described above, it may be 
possible to compute the infinite-horizon discounted value function of a Pomdp. 
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